13 research outputs found

    An analyser and generator for Irish inflectional morphology using finite-state transducers

    Get PDF
    Computational morphology is an important step in natural language processing. Finite-state techniques have been applied successfully in computational phonology and morphology to many of the world’s major languages. Celtic languages, such as Modern Irish, present unique and challenging morphological features that to date have not been addressed using finite-state technology. This thesis presents a finite-state morphology of Irish developed using Xerox Finite-State Tools. To the best of our knowledge, such a resource does not exist. The computational model, implemented as a finite-state transducer, encodes the inflectional morphology of nouns, adjectives, and verbs. Other parts of speech are also included in the interests of language coverage. The implementation is a strictly lexicalised design: the morphotactics of stems and affixes are encoded in the lexicon using replace rule triggers. Word mutations are then implemented as a series of replace rules written as regular expressions. Both components are compiled into finite state transducers and then combined, to produce a single two-level morphological transducer for the language. A major advantage of finite-state implementations of morphology is their inherent bi-directionality; the same system is used for both analysis and generation of word forms in the language. This resource can be used as a component part in parsing and generation in natural language processing (NLP) applications, such as spelling checkers/correctors, stemmers and text to speech synthesisers. It can also be used for tokenising text, lemmatising, and as an input to automatic partof- speech tagging of a corpus. The system is designed for broad coverage of the language and this is evaluated by comparing it with a list of the 1000 most frequently found word forms in a corpus of contemporary Irish texts. Finally, maintainability of the system is discussed and possible extensions to the system are suggested, such as derivational morphology and the inclusion of dialectal or historical word-forms

    Partial dependency parsing for Irish

    Get PDF
    In this paper we present a partial dependency parser for Irish, in which Constraint Grammar (CG) rules are used to annotate dependency relations and grammatical functions in unrestricted Irish text. Chunking is performed using a regular-expression grammar which operates on the dependency tagged sentences. As this is the first implementation of a parser for unrestricted Irish text (to our knowledge), there were no guidelines or precedents available. Therefore deciding what constitutes a syntactic unit, and how it should be annotated, accounts for a major part of the early development effort. Currently, all tokens in a sentence are tagged for grammatical function and local dependency. Long-distance dependencies, prepositional attachments or coordination are not handled, resulting in a partial dependency analysis. Evaluations show that the partial dependency analysis achieves an f-score of 93.60% on development data and 94.28% on unseen test data, while the chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data

    A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation

    Get PDF
    This paper describes the methodology used to develop a part-of-speech tagger for Irish, which is used to annotate a corpus of 30 million words of text with part-of-speech tags and lemmas. The tagger is evaluated using a manually disambiguated test corpus and it currently achieves 95% accuracy on unrestricted text. To our knowledge, this is the first part-of-speech tagger for Irish

    Active learning and the Irish treebank

    Get PDF
    We report on our ongoing work in developing the Irish Dependency Treebank, describe the results of two Inter annotator Agreement (IAA) studies, demonstrate improvements in annotation consistency which have a knock-on effect on parsing accuracy, and present the final set of dependency labels. We then go on to investigate the extent to which active learning can play a role in treebank and parser development by comparing an active learning bootstrapping approach to a passive approach in which sentences are chosen at random for manual revision. We show that active learning outperforms passive learning, but when annotation effort is taken into account, it is not clear how much of an advantage the active learning approach has. Finally, we present results which suggest that adding automatic parses to the training data along with manually revised parses in an active learning setup does not greatly affect parsing accuracy

    Quizzes on tap: exporting a test generation system from one less resourced language to another

    Get PDF
    It is difficult to develop and deploy Language Technology and applications for minority languages for many reasons. These include the lack of Natural Language Processing (NLP) resources for the language, a scarcity of NLP researchers who speak the language and the communication gap between teachers in the classroom and researchers working in universities and other centres of research. One approach to overcoming these obstacles is for researchers interested in Less-Resourced Languages (LRLs) to work together in reusing and adapting existing resources where possible. This article outlines how a multiple-choice quiz generator for Basque was adapted for Irish. The Quizzes on Tap (QOT) system uses Latent Semantic Analysis (LSA) to automatically generate multiple choice test items. Adapting the Basque application to work for Irish involved the sourcing of suitable Irish corpora and a morphological engine for Irish, as well as the compilation of a development set. Various integration issues arising from differences between Basque and Irish needed to be dealt with. The QOT system provides a useful resource that enables Irish teachers to produce both domain-specific and generalknowledge quizzes in a timely manner, for children with varying levels of exposure to the language. Keywords: LRL, less-resourced languages, Irish, morphological analysis, multiple choice tes

    Irish treebanking and parsing: a preliminary evaluation

    Get PDF
    Language resources are essential for linguistic research and the development of NLP applications. Low- density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish – namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development

    International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora

    Get PDF
    This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.Peer reviewe
    corecore